An Algorithm For Identifying Cognates Between Related Languages

نویسنده

  • Jacques B. M. Guy
چکیده

The algorithm takes as only input a llst of words, preferably but not necessarily in phonemic transcription, in any two putatively related languages, and sorts it into decreasing order of probable cognatlon. The processing of a 250-1tem bilingual list takes about five seconds of CPU time on a DEC KLI091, and requires 56 pages of core memory. The algorithm is given no information whatsoever about the phonemic transcription .used, and even though cognate identification is carried out on the basis of a context-free one-for-one matching of indivldual characters, its cognation decisions are bettered by a trained linguist using more information only in cases of wordllsts sharing less than 40% cognates and involving complex, mu]tlple sound correspondences. I FUNDAMENTAL PROCEDURES A. Identifying Sound Correspondences Consider the following wordllst from two hypothetical Austronesian-llke ivnguages:

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Building a Dataset of Multilingual Cognates for the Romanian Lexicon

Identifying cognates is an interesting task with applications in numerous research areas, such as historical and comparative linguistics, language acquisition, cross-lingual information retrieval, readability and machine translation. We propose a dictionary-based approach to identifying cognates based on etymology and etymons. We account for relationships between languages and we extract etymol...

متن کامل

Identifying Cognates by Phonetic and Semantic Similarity

I present a method of identifying cognates in the vocabularies of related languages. I show that a measure of phonetic similarity based on multivalued features performs better than “orthographic” measures, such as the Longest Common Subsequence Ratio (LCSR) or Dice’s coefficient. I introduce a procedure for estimating semantic similarity of glosses that employs keyword selection and WordNet. Te...

متن کامل

Computing Word Similarity and Identifying Cognates with Pair Hidden Markov Models

We present a system for computing similarity between pairs of words. Our system is based on Pair Hidden Markov Models, a variation on Hidden Markov Models that has been used successfully for the alignment of biological sequences. The parameters of the model are automatically learned from training data that consists of word pairs known to be similar. Our tests focus on the identification of cogn...

متن کامل

A Dictionary-Based Approach for Evaluating Orthographic Methods in Cognates Identification

In this paper we propose a method for identifying cognates based on etymology and etymons. We employ this approach to evaluate the extent to which lexical similarity can be used for automatic detection of cognate pairs. We investigate some orthographic approaches widely used in this research area and some original metrics as well. We apply this procedure for Romanian and its most closely relate...

متن کامل

Identifying Complex Sound Correspondences in Bilingual Wordlists

The determination of recurrent sound correspondences between languages is crucial for the identification of cognates, which are often employed in statistical machine translation for sentence and word alignment. In this paper, an algorithm designed for extracting non-compositional compounds from bitexts is shown to be capable of determining complex sound correspondences in bilingual wordlists. I...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1984